A Corpus-Based Statistical Approach to Automatic Book Indexing
نویسندگان
چکیده
The paper reports on a new approach to automatic generation of back-of-book indexes for Chinese books. Parsing on the level of complete sentential analysis is avoided because of the inefficiency and unavailability of a Chinese Grammar with enough coverage. Instead, fundamental analysis particular to Chinese text called word segmentation is performed to break up characters into a sequence of lexical units equivalent to words in English. The sequence of words then goes through part-ofspeech tagging and noun phrase analysis. All these analyses are done using a corpus-based statistical algorithm. Experimental results have shown satisfactory results.
منابع مشابه
Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation
Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...
متن کاملBriefly Noted
With the explosion in the quantity of online information in recent years, automatic abstracting and indexing has received renewed interest and a number of promising approaches have emerged. The goal of this book is to present a complete description of current indexing and abstracting techniques in the context of the underlying linguistic and statistical knowledge. The book has three parts: the ...
متن کاملCorpus-Based Learning Of Compound Noun Indexing
In this paper, we present a corpusbased learning method that can index diverse types of compound nouns using rules automatically extracted from a large tagged corpus. We develop an e cient way of extracting the compound noun indexing rules automatically and perform extensive experiments to evaluate our indexing rules. The automatic learning method shows about the same performance compared with ...
متن کاملn-grams of Seeds: A Hybrid System for Corpus-Based Text Summarization
This paper presents a hybrid system for automatic text summarization which combines statistical and knowledge-based methods. In particular, it demonstrates how two corpus-based learning and indexing algorithms, namely an n-gram and a seed-oriented approach, may be combined to bring out the best of both approaches. This system selects sentences from an input text to constract a highly compressed...
متن کاملNoun-Phrase Analysis in Unrestricted Text for Information Retrieval
Information retrieval is an important application area of natural-language processing where one encounters the genuine challenge of processing large quantities of unrestricted natural-language text. This paper reports on the application of a few simple, yet robust and efficient nounphrase analysis techniques to create better indexing phrases for information retrieval. In particular, we describe...
متن کامل